Bayesian Modeling of Panel Data with R Package dynamite
useR! 2024
Santtu Tikka (joint work with Jouni Helske)
University of Jyväskylä
Panel Data
Data from multiple individuals \(i = 1,\ldots,N\)
Persons, countries, firms, …
Measured over multiple time points \(t = 1,\ldots,T\).
Can be calendar time, age, or some other variable defining the order of the observations.
We focus on cases with relatively large \(T\) (e.g., \(T = 100\)), sometimes referred to as intensive longitudinal data.
Multiple measurements per individual \(c = 1,\ldots,C\)
Person’s yearly income, health status, and partnership status, place of residence, date of birth, …
Often used for observational causal inference.
Statistical Methods
Several modelling approaches, depending on the observed and assumed characteristics of the data and the underlying causal graph.
Often based on some variant of cross-lagged panel model (CLPM).
e.g., Allison et al. 2017, Hamaker et al. 2015, Asparouhov et al. 2018
Estimation is typically based on structural equation models (SEM) using maximum likelihood.
Models are sometimes formulated on a latent level instead of the observational level
This can make causal interpretations challenging
Dynamic Multivariate Panel Models
The dynamic multivariate panel model (DMPM), implemented in R package dynamite, supports
Multiple response variables with various distributions, e.g., Gaussian, Poisson, binomial, categorical, …
Time-varying and time-invariant effects
Time-varying effects are defined using Bayesian P-splines
Subject-specific random effects
Latent factors
Fully Bayesian estimation via Stan
Also provides us with posterior distributions of causal effects
Both rstan and cmdstanr are supported
Models are defined on the observational level instead of the latent level
Employment Trajectories
As an illustration, we study a subset of the data from Swiss Household Panel (Tillmann et al. 2016), obtained from the R package march (Maitre et al. 2020)
The data consist of the employment trajectories of 845 individuals (421 women and 424 men).
The employment response variable fulltime was observed every two years from age of 20 to the age of 44 (i.e., 13 time points)
coded as being either 1 (full-time employee) or 0 (other).
We assume that the employment status depends on the full work history,
simplified here to an indicator for if the person has ever worked full time or not (ever)
For simplicity, we consider only one time-invariant covariate:
gender coded as 0 = woman, 1 = man (time-invariant)
Preparing the Data
First, we prepare the data for analysis
library("dplyr")library("tibble")library("march")library("dynamite")N <- Employment.2@Nd <-data.frame(employment =factor(c(Employment.2@yRaw), labels =c("Full-time", "Other")),gender =factor(Employment.2@cov[,,1], labels =c("Woman", "Man")),id =1:N, age =rep(seq(20, 44, by =2), each = N)) |>mutate(fulltime =as.integer(employment =="Full-time"))
Model Structure
The graph above depicts our assumed model structure (causal diagram)
\(y_{t}\) is the employment status at time \(t\) (data variable fulltime)
\(h_{t}\) is the work history status at time \(t\) (data variable ever)
dynamite tries to select suitable priors for the model parameters
get_priors() gives the default priors as a data frame
Priors can be customized by replacing the priors we want by entering a suitable distribution in the prior column
The prior is given in Stan syntax and should be supported by Stan
Fitting the Model
fit <-dynamite( model_formula, data = d, group ="id", time ="age", priors = priors,chains =4, cores =4, iter =5000, refresh =0)
We fit the DMPM with dynamite()
group and time define the columns of data that identify individuals and time points, respectively
priors (optional) defines the priors (if not default)
chains, cores, iter and refresh are arguments to Stan (default backend is rstan)
We compute 4 parallel chains with 5000 posterior draws each
Summarizing the Model Fit
The default print() method provides a summary of the model fit including some diagnostics
print(fit)
Model:
Family
fulltime bernoulli
ever deterministic
Formula
fulltime fulltime ~ -1 + gender:lag(ever) + varying(~-1 + gender + gender:lag(fulltime))
ever numeric(ever) ~ fulltime == 1 | lag(ever) == 1 | init(0)
Data: d (Number of observations: 10140)
Grouping variable: id (Number of groups: 845)
Time index variable: age (Number of time points: 13)
NUTS sampler diagnostics:
No divergences, saturated max treedepths or low E-BFMIs.
Summarizing the Model Fit
Smallest bulk-ESS: 6898 (tau_fulltime_genderWoman)
Smallest tail-ESS: 5691 (tau_fulltime_genderWoman:fulltime_lag1)
Largest Rhat: 1.001 (beta_fulltime_genderWoman:ever_lag1)
Elapsed time (seconds) for fastest and slowest chains:
warmup sample
chain:2 131.777 161.224
chain:1 130.105 166.147
Summary statistics of the time- and group-invariant parameters:
# A tibble: 6 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta_fulltime_… 0.575 0.577 0.221 0.222 0.204 0.937 1.00 12463. 8182.
2 beta_fulltime_… 0.0677 0.0684 0.230 0.230 -0.308 0.448 1.00 11623. 7958.
3 tau_fulltime_g… 1.13 1.06 0.388 0.356 0.619 1.86 1.00 6954. 7094.
Other functions to query the model fit include:
summary(): provides summary statistics of all model parameters
coef(): provides summary statistics of regression coefficients (time-invariant and time-varying)
mcmc_diagnostics(): computes effective sample sizes and convergence diagnostics
Visualizing the Model Fit
The default plot() method visualizes the posterior distributions of the model parameters
plot(fit)
Causal Inference with DMPMs
Our approach to causal inference is based on structural causal models (SCM, Pearl, 2009)
Causal effects are defined through interventions; external changes to the data-generating process
Notation: \(P(Y = y|do(X = x))\) is the causal effect of \(X\) on \(Y\)
We are interested on how an intervention on employment at the age of 30 affects the future employment status.
We compare two interventions: forcing everyone to work full time vs. forcing everyone to not work full time at the age of 30.
Assuming that the model structure of the DMPM depicts the true data-generating process, we can obtain the posterior distribution of the causal effect (interventional distribution)
We can then assess the magnitude and uncertainty of these effects
Employment Interventions
We can use the backdoor adjustment to compute the causal effect of employment at age \(t\) on employment at any future age \(t+k\)
\(P(y_{t+j},h_{t+j}|y_{t+j-1},h_{t+j-1},g)\): Our model fit is exactly this!
\(P(h_t,g)\): We can use the empirical distribution of the data instead of explicit modeling
Implementing the Employment Interventions
We fix the employment status (fulltime) for everyone at age 30 while keeping the values of other variables the same as the data
We set the future values to NA as we will predict these with dynamite to obtain the counterfactual values
Finally, we predict starting from age 30 to obtain the causal effects at each age
# No full time employment at age 30newdata0 <- d |>filter(age >=28)newdata0$fulltime[newdata0$age ==30] <-0newdata0$fulltime[newdata0$age >30] <-NA# Full time employment at age 30newdata1 <- d |>filter(age >=28)newdata1$fulltime[newdata1$age ==30] <-1newdata1$fulltime[newdata1$age >30] <-NA
Employment Interventions
We compute the predictions using predict() and contrast the two interventions
Helske J, Tikka S (2024). Estimating Causal Effects from Panel Data with Dynamic Multivariate Panel Models. Advances in Life Course Research, 60, 100617
Allison PD, Williams R, Moral-Benito E (2017). Maximum Likelihood for Cross-Lagged Panel Models with Fixed Effects. Socius, 3.
Asparouhov T, Hamaker EL, Muthén B (2018). Dynamic Structural Equation Models. Structural Equation Modeling: A Multidisciplinary Journal, 25(3), 359–388.
Hamaker EL, Kuiper RM, Grasman RP (2015). A Critique of the Cross-Lagged Panel Model. Psychological methods, 20(1), 102.
Maitre O, Emery K, Buschor O, Berchtold A (2020). march: Markov Chains. R package version 3.3.2.
Pearl J (2009). Causality: Models, Reasoning and Inference. 2nd Edition. Cambridge University Press